Scatter plots are used to display the relationship between two continuous variables. In a scatter plot, each observation in a data set is represented by a point. Often, a scatter plot will also have a line showing the predicted values based on some statistical model. This is easy to do with R and ggplot2, and can help to make sense of data when the trends aren’t immediately obvious just by looking at it.
With large data sets, it can be problematic to plot every single observation because the points will be overplotted, obscuring one another. When this happens, you’ll probably want to summarize the data before displaying it. We’ll also see how to do that in this chapter.
You want to make a scatter plot.
Use geom_point(), and map one variable to x and one to y.
In the heightweight data set, there are a number of columns, but we’ll only use two in this example.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
data(heightweight)
head(heightweight)
sex ageYear ageMonth heightIn weightLb
1 f 11.92 143 56.3 85.0
2 f 12.92 155 62.3 105.0
3 f 12.75 153 63.3 108.0
4 f 13.42 161 59.0 92.0
5 f 15.92 191 62.5 112.5
6 f 14.25 171 62.5 112.0
# List the two columns we'll use
ht<-heightweight[, c("ageYear", "heightIn")]
head(ht)
ageYear heightIn
1 11.92 56.3
2 12.92 62.3
3 12.75 63.3
4 13.42 59.0
5 15.92 62.5
6 14.25 62.5
ggplot(heightweight, aes(x=ageYear, y=heightIn)) + geom_point(color="red",size=1)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
data(heightweight)
head(heightweight)
sex ageYear ageMonth heightIn weightLb
1 f 11.92 143 56.3 85.0
2 f 12.92 155 62.3 105.0
3 f 12.75 153 63.3 108.0
4 f 13.42 161 59.0 92.0
5 f 15.92 191 62.5 112.5
6 f 14.25 171 62.5 112.0
g1<-ggplot(heightweight, aes(x=ageYear, y=heightIn)) + geom_point(shape=21)
g2<-ggplot(heightweight, aes(x=ageYear, y=heightIn)) + geom_point(size=1.5)
grid.arrange(g1,g2,ncol=2)
The size of the points can be controlled with size. The default value of size is 2. The following will set size=1.5, for smaller points.
When displaying to screen or outputting to bitmap files like PNG, the default solid circle shape (#16) can result in aliased (jagged-looking) edges on some platforms. An alternative is to use shape 19, which is also a solid circle, but comes out smooth in more cases (see Figure 5-3). See Recipe 14.5 for more about anti-aliased output.
You want to group points by some variable, using shape or color.
Map the grouping variable to shape or colour. In the heightweight data set, there are many columns, but we’ll only use three of them in this example:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(heightweight)
# Show the three columns we'll use
ht<-heightweight[, c("sex", "ageYear", "heightIn")]
head(ht)
sex ageYear heightIn
1 f 11.92 56.3
2 f 12.92 62.3
3 f 12.75 63.3
4 f 13.42 59.0
5 f 15.92 62.5
6 f 14.25 62.5
g1<-ggplot(heightweight, aes(x=ageYear, y=heightIn, colour=sex)) + geom_point()
ggplotly(g1)
g2<-ggplot(heightweight, aes(x=ageYear, y=heightIn, shape=sex)) + geom_point()
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
We can group points on the variable sex, by mapping sex to one of the aesthetics colour or shape.
The grouping variable must be categorical—in other words, a factor or character vector. If it is stored as a vector of numeric values, it should be converted to a factor before it is used as a grouping variable.
It is possible to map a variable to both shape and colour, or, if you have multiple grouping variables, to map different variables to them. Here, we’ll map sex to shape and colour.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(heightweight)
g1<-ggplot(heightweight, aes(x=ageYear, y=heightIn, shape=sex, colour=sex)) + geom_point()
ggplotly(g1)
g2<-ggplot(heightweight, aes(x=ageYear, y=heightIn, shape=sex, colour=sex)) + geom_point() + scale_shape_manual(values=c(1,2)) + scale_colour_brewer(palette="Set1")
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
The default shapes and colors may not be very appealing. Other shapes can be used with scale_shape_manual(), and other colors can be used with scale_colour_brewer() or scale_colour_manual().
This will set different shapes and colors for the grouping variables.
You want to use point shapes that are different from the defaults.
If you want to set the shape of all the points, specify the shape in geom_point().
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(heightweight)
g1<-ggplot(heightweight, aes(x=ageYear, y=heightIn)) + geom_point(shape=3)
ggplotly(g1)
# Use slightly larger points and use a shape scale with custom values
g2<-ggplot(heightweight, aes(x=ageYear, y=heightIn, shape=sex)) + geom_point(size=3) + scale_shape_manual(values=c(1, 4))
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
The shapes that are available in R graphics are numbered from 1 to 25. Some of the point shapes (1–14) have just an outline, some (15–20) are solid, and some (21–25) have an outline and fill that can be controlled separately. (You can also use characters for points and these include hash tag,asterik,zero,minus,plus,percentage etc.)
For shapes 1–20, the color of the entire point—even the points that are solid—is controlled by the colour aesthetic. For shapes 21–25, the outline is controlled by colour and the fill is controlled by fill.
It’s possible to have the shape represent one variable and the fill (empty or solid) represent another variable. This is done a little indirectly, by choosing shapes that have both colour and fill, and a color palette that includes NA and another color (the NA will result in a hollow shape). For example, we’ll take the heightweight data set and add another column that indicates whether the child weighed 100 pounds or more.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(heightweight)
# Make a copy of the data
hw <- heightweight
# Categorize into <100 and >=100 groups
hw$weightGroup <- cut(hw$weightLb, breaks=c(-Inf, 100, Inf),labels=c("< 100", ">= 100"))
# Use shapes with fill and color, and use colors that are empty (NA) and fill.
g1<-ggplot(hw, aes(x=ageYear, y=heightIn, shape=sex, fill=weightGroup)) + geom_point(size=2.5) +
scale_shape_manual(values=c(21, 24)) + scale_fill_manual(values=c(NA, "black"),
guide=guide_legend(override.aes=list(shape=21)))
ggplotly(g1)
You want to represent a third continuous variable using color or size.
Map the continuous variable to size or colour. In the heightweight data set, there are many columns, but we’ll only use four of them in this example.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(heightweight)
# List the four columns we'll use
hw<-heightweight[, c("sex", "ageYear", "heightIn", "weightLb")]
head(hw)
sex ageYear heightIn weightLb
1 f 11.92 56.3 85.0
2 f 12.92 62.3 105.0
3 f 12.75 63.3 108.0
4 f 13.42 59.0 92.0
5 f 15.92 62.5 112.5
6 f 14.25 62.5 112.0
g1<-ggplot(heightweight, aes(x=ageYear, y=heightIn, colour=weightLb)) + geom_point()
ggplotly(g1)
g2<-ggplot(heightweight, aes(x=ageYear, y=heightIn, size=weightLb)) + geom_point()
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
The basic scatter plot above shows the relationship between the continuous variables ageYear and heightIn. To represent a third continuous variable, weightLb, we must map it to another aesthetic property. We can map it to colour or size, as shown above.
A basic scatter plot shows the relationship between two continuous variables: one mapped to the x-axis, and one to the y-axis. When there are more than two continuous variables, they must be mapped to other aesthetics: size and/or color.
We can easily perceive small differences in spatial position, so we can interpret the variables mapped to x and y coordinates with high accuracy. We aren’t very good at perceiving small differences in size and color, though, so we will interpret variables mapped to these aesthetic attributes with a much lower accuracy. When you map a variable to one of these properties, it should be one where accuracy is not very important for interpretation.
When a variable is mapped to size, the results can be perceptually misleading. Thelargest dots in Figure 5-9 have about 36 times the area of the smallest ones, but they represent only about 3.5 times the weight. If it is important for the sizes to proportionallyrepresent the quantities, you can change the range of sizes. By default the sizes of points go from 1 to 6 mm. You could reduce the range to, say, 2 to 5 mm, with scale_size_continuous(range=c(2, 5)). However, the point size numbers don’t map linearly to diameteror area, so this still won’t give a very accurate representation of the values.
When it comes to color, there are actually two aesthetic attributes that can be used: colour and fill. For most point shapes, you use colour. However, shapes 21–25 have an outline with a solid region in the middle where the color is controlled by fill. These outlined shapes can be useful when using a color scale with light colors, as in Figure 5-10, because the outline sets them off from the background. In this example, we also set the fill gradient to go from black to white and make the points larger so that the fill is easier to see.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(heightweight)
g1<-ggplot(heightweight, aes(x=weightLb, y=heightIn, fill=ageYear)) + geom_point(shape=21, size=2.5) +
scale_fill_gradient(low="black", high="white")
ggplotly(g1)
# Using guide_legend() will result in a discrete legend instead of a colorbar
g2<-ggplot(heightweight, aes(x=weightLb, y=heightIn, fill=ageYear)) + geom_point(shape=21, size=2.5) + scale_fill_gradient(low="black", high="white", breaks=12:17,guide=guide_legend())
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(heightweight)
ggplot(heightweight, aes(x=ageYear, y=heightIn, size=weightLb, colour=sex)) +
geom_point(alpha=.5) +
scale_size_area() + # Make area proportional to numeric value
scale_colour_brewer(palette="Set1")
You have many points and they obscure each other.
With large data sets, the points in a scatter plot may obscure each other and prevent the viewer from accurately assessing the distribution of the data. This is called overplotting. If the amount of overplotting is low, you may be able to alleviate it by using smaller points, or by using a different shape (like shape 1, a hollow circle) through which other points can be seen. Figure 5-2 in Recipe 5.1 demonstrates both of these solutions.
If there’s a high degree of overplotting, there are a number of possible solutions:
Make the points semitransparent.
Bin the data into rectangles (better for quantitative analysis).
Bin the data into hexagons.
Use box plots.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(diamonds)
sp <- ggplot(diamonds, aes(x=carat, y=price))
sp + geom_point()
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(diamonds)
sp <- ggplot(diamonds, aes(x=carat, y=price))
sp + geom_point(alpha=.1)
sp + geom_point(alpha=.01)
Now we can see that there are vertical bands at nice round values of carats, indicating that diamonds tend to be cut to those sizes. Still, the data is so dense that even when the points are 99% transparent, much of the graph appears solid black, and the data distribution is still somewhat obscured.
For most graphs, vector formats (such as PDF, EPS, and SVG) result in smaller output files than bitmap formats (such as TIFF and PNG). But in cases where there are tens of thousands of points, vector output files can be very large and slow to render—the scatter plot here with 99% transparent points is 1.5 MB! In these cases, high-resolution bitmaps will be smaller and faster to display on computer screens.
Another solution is to bin the points into rectangles and map the density of the points to the fill color of the rectangles, as shown in Figure 5-14. With the binned visualization, the vertical bands are barely visible. The density of points in the lower-left corner is much greater, which tells us that the vast majority of diamonds are small and inexpensive.
By default, stat_bin_2d() divides the space into 30 groups in the x and y directions, for a total of 900 bins. In the second version, we increase the number of bins with bins=50.
The default colors are somewhat difficult to distinguish because they don’t vary much in luminosity. In the second version we set the colors by using scale_fill_gradient() and specifying the low and high colors. By default, the legend doesn’t show an entry for the lowest values. This is because the range of the color scale starts not from zero, but from the smallest nonzero quantity in a bin—probably 1, in this case. To make the legend show a zero (as in Figure 5-14, right), we can manually set the range from 0 to the maximum, 6000, using limits.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
data(diamonds)
sp <- ggplot(diamonds, aes(x=carat, y=price))
sp + stat_bin2d()
sp + stat_bin2d(bins=50) +
scale_fill_gradient(low="lightblue", high="red", limits=c(0, 6000))
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(diamonds)
g1<-sp + stat_binhex() + scale_fill_gradient(low="lightblue", high="red", limits=c(0, 8000))
ggplotly(g1)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(diamonds)
sp <- ggplot(diamonds, aes(x=carat, y=price))
g1<-sp + stat_bin2d(bins=50) + scale_fill_gradient(low="lightblue", high="red", limits=c(0, 6000))
g2<-sp + stat_binhex() + scale_fill_gradient(low="lightblue", high="red", limits=c(0, 8000))
grid.arrange(g1,g2,ncol=2)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(diamonds)
sp <- ggplot(diamonds, aes(x=carat, y=price))
sp + stat_binhex() + scale_fill_gradient(low="lightblue", high="red",
breaks=c(0, 250, 500, 1000, 2000, 4000, 6000),limits=c(0, 6000))
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(ChickWeight)
sp1 <- ggplot(ChickWeight, aes(x=Time, y=weight))
g1<-sp1 + geom_point()
g2<-sp1 + geom_point(position="jitter")
# Could also use geom_jitter(), which is equivalent
g3<-sp1 + geom_point(position=position_jitter(width=.5, height=0))
grid.arrange(g1,g2,g3,ncol=3)
When the data has one discrete axis and one continuous axis, it might make sense to use box plots, as shown in Figure 5-17. This will convey a different story than a standard scatter plot because it will obscure the number of data points at each location on the discrete axis. This may be problematic in some cases, but desirable in others.
With the ChickWeights data, the x-axis is conceptually discrete, but since it is stored numerically, ggplot() doesn’t know how to group the data for each box. If you don’t tell it how to group the data, you get a result like the graph on the right in Figure 5-17. To tell it how to group the data, use aes(group=…). In this case, we’ll group by each distinct value of Time.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(ChickWeight)
sp1 <- ggplot(ChickWeight, aes(x=Time, y=weight))
sp1 + geom_boxplot(aes(group=Time))
You want to add lines from a fitted regression model to a scatter plot.
To add a linear regression line to a scatter plot, add stat_smooth() or geom_smooth() and tell it to use method=lm. This instructs it to fit the data with the lm() (linear model) function. First we’ll save the base plot object in sp, then we’ll add different components to it.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
# The base plot
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn))
g1<-sp + geom_point() + stat_smooth(method=lm)
ggplotly(g1)
g2<-sp + geom_point() + geom_smooth(method=lm)
ggplotly(g2)
# The two codes produce the same plot
grid.arrange(g1,g2,ncol=2)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
# The base plot
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn))
# 99% confidence region
g1<-sp + geom_point() + stat_smooth(method=lm, level=0.99)
ggplotly(g1)
# No confidence region
g2<-sp + geom_point() + stat_smooth(method=lm, se=FALSE)
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
# The base plot
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn))
g1<-sp + geom_point(colour="red",size=1) + stat_smooth(method=lm, se=FALSE, colour="black")
ggplotly(g1)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
# The base plot
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn))
g1<-sp + geom_point(colour="grey60") + stat_smooth()
ggplotly(g1)
g2<-sp + geom_point(colour="grey60") + stat_smooth(method=loess)
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
Another common type of model fit is a logistic regression. Logistic regression isn’t appropriate for the heightweight data set, but it’s perfect for the biopsy data set in the MASS library. In this data set, there are nine different measured attributes of breast cancer biopsies, as well as the class of the tumor, which is either benign or malignant. To prepare the data for logistic regression, we must convert the factor class, with the levels benign and malignant, to a vector with numeric values of 0 and 1. We’ll make a copy of the biopsy data frame, then store the numeric coded class in a column called classn.
Although there are many attributes we could examine, for this example we’ll just look at the relationship of V1 (clump thickness) and the class of the tumor. Because there is a large degree of overplotting, we’ll jitter the points and make them semitransparent (alpha=0.4), hollow (shape=21), and slightly smaller (size=1.5). Then we’ll add a fitted logistic regression line (Figure 5-20) by telling stat_smooth() to use the glm() function with the option family=binomial.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
library(MASS) # for the data set
b <- biopsy
b$classn[b$class=="benign"] <- 0
b$classn[b$class=="malignant"] <- 1
g1<-ggplot(b, aes(x=V1, y=classn)) + geom_point(position=position_jitter(width=0.3, height=0.06), alpha=0.4, shape=21, size=1.5) + stat_smooth(method=glm, family=binomial)
ggplotly(g1)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
sps <- ggplot(heightweight, aes(x=ageYear, y=heightIn, colour=sex)) + geom_point() + scale_colour_brewer(palette="Set1")
g1<-sps + geom_smooth()
ggplotly(g1)
g2<-sps + geom_smooth(method="lm",se=FALSE)
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
Notice that the blue line, for males, doesn’t run all the way to the right side of the graph. There are two reasons for this. The first is that, by default, stat_smooth() limits the prediction to within the range of the predictor data (on the x-axis). The second is that even if it extrapolates, the loess() function only offers prediction within the x range of the data.
If you want the lines to extrapolate from the data, as shown in the right-hand image of Figure 5-21, you must use a model method that allows extrapolation, like lm(), and pass stat_smooth() the option fullrange=TRUE:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
sps <- ggplot(heightweight, aes(x=ageYear, y=heightIn, colour=sex)) + geom_point() + scale_colour_brewer(palette="Set1")
sps + geom_smooth(method=lm, se=FALSE, fullrange=TRUE)
You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.
Usually the easiest way to overlay a fitted model is to simply ask stat_smooth() to do it for you, as described in Recipe 5.6. Sometimes, however, you may want to create the model yourself and then add it to your graph. This allows you to be sure that the model you’re using for other calculations is the same one that you see.
In this example, we’ll build a quadratic model using lm() with ageYear as a predictor of heightIn. Then we’ll use the predict() function and find the predicted values of heightIn across the range of values for the predictor, ageYear:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
model <- lm(heightIn ~ ageYear + I(ageYear^2), heightweight)
model
Call:
lm(formula = heightIn ~ ageYear + I(ageYear^2), data = heightweight)
Coefficients:
(Intercept) ageYear I(ageYear^2)
-10.3136 8.6673 -0.2478
# Create a data frame with ageYear column, interpolating across range
xmin <- min(heightweight$ageYear)
xmax <- max(heightweight$ageYear)
predicted <- data.frame(ageYear=seq(xmin, xmax, length.out=100))
# Calculate predicted values of heightIn
predicted$heightIn <- predict(model, predicted)
head(predicted)
ageYear heightIn
1 11.58000 56.82624
2 11.63980 57.00047
3 11.69960 57.17294
4 11.75939 57.34363
5 11.81919 57.51255
6 11.87899 57.67969
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
model <- lm(heightIn ~ ageYear + I(ageYear^2), heightweight)
# Create a data frame with ageYear column, interpolating across range
xmin <- min(heightweight$ageYear)
xmax <- max(heightweight$ageYear)
predicted <- data.frame(ageYear=seq(xmin, xmax, length.out=100))
# Calculate predicted values of heightIn
predicted$heightIn <- predict(model, predicted)
head(predicted)
ageYear heightIn
1 11.58000 56.82624
2 11.63980 57.00047
3 11.69960 57.17294
4 11.75939 57.34363
5 11.81919 57.51255
6 11.87899 57.67969
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn)) +
geom_point(colour="grey40")
sp + geom_line(data=predicted, size=1)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn)) + geom_point(colour="grey40")
predictvals <- function(model, xvar, yvar, xrange=NULL, samples=100, ...) {
# If xrange isn't passed in, determine xrange from the models.
# Different ways of extracting the x range, depending on model type
if (is.null(xrange)) {
if (any(class(model) %in% c("lm", "glm")))
xrange <- range(model$model[[xvar]])
else if (any(class(model) %in% "loess"))
xrange <- range(model$x)
}
newdata <- data.frame(x = seq(xrange[1], xrange[2], length.out = samples))
names(newdata) <- xvar
newdata[[yvar]] <- predict(model, newdata = newdata, ...)
newdata
}
modlinear <- lm(heightIn ~ ageYear, heightweight)
modloess <- loess(heightIn ~ ageYear, heightweight)
#Then we can call predictvals() on each model, and pass the resulting data frames to geom_line()
lm_predicted <- predictvals(modlinear, "ageYear", "heightIn")
loess_predicted <- predictvals(modloess, "ageYear", "heightIn")
sp + geom_line(data=lm_predicted, colour="red", size=.8) + geom_line(data=loess_predicted, colour="blue", size=.8)
For glm models that use a nonlinear link function, you need to specify type=“response” to the predictvals() function. This is because the default behavior is to return predicted values in the scale of the linear predictors, instead of in the scale of the response (y) variable.
To illustrate this, we’ll use the biopsy data set from the MASS library. As we did in Recipe 5.6, we’ll use V1 to predict class. Since logistic regression uses values from 0 to 1, while class is a factor, we’ll first have to convert class to 0s and 1s:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
library(MASS) # For the data set
b <- biopsy
b$classn[b$class=="benign"] <- 0
b$classn[b$class=="malignant"] <- 1
#Next, we’ll perform the logistic regression:
fitlogistic <- glm(classn ~ V1, b, family=binomial)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
library(MASS) # For the data set
b <- biopsy
b$classn[b$class=="benign"] <- 0
b$classn[b$class=="malignant"] <- 1
#Next, we’ll perform the logistic regression:
fitlogistic <- glm(classn ~ V1, b, family=binomial)
# Get predicted values
glm_predicted <- predictvals(fitlogistic, "V1", "classn", type="response")
ggplot(b, aes(x=V1, y=classn)) +
geom_point(position=position_jitter(width=.3, height=.08), alpha=0.4,
shape=21, size=1.5) + geom_line(data=glm_predicted, colour="#1177FF", size=1)
You have already created a fitted regression model object for a data set, and you want to plot the lines for that model.
Use the predictvals() function from the previous recipe along with dlply() and ldply() from the plyr package.
With the heightweight data set, we’ll make a linear model with lm() for each of the levels of sex, and put those model objects in a list. The model building is done with a function, make_model(), defined here. If you pass it a data frame, it simply returns an lm object. The model can be customized for your data.
With the function below, we can use the dlply() function to build a model for each subset of data. This will split the data frame into subsets by the grouping variable sex, and apply make_model() to each subset. In this case, the heightweight data will be split into two data frames, one for males and one for females, and make_model() will be run on each subset. With dlply(), the models are put into a list and the list is returned.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
make_model <- function(data) {
lm(heightIn ~ ageYear, data)
}
library(plyr)
models <- dlply(heightweight, "sex", .fun = make_model)
# Print out the list of two lm objects, f and m
models
$f
Call:
lm(formula = heightIn ~ ageYear, data = data)
Coefficients:
(Intercept) ageYear
43.963 1.209
$m
Call:
lm(formula = heightIn ~ ageYear, data = data)
Coefficients:
(Intercept) ageYear
30.658 2.301
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
sex
1 f
2 m
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
make_model <- function(data) {
lm(heightIn ~ ageYear, data)
}
library(plyr)
models <- dlply(heightweight, "sex", .fun = make_model)
predictvals <- function(model, xvar, yvar, xrange=NULL, samples=100, ...) {
# If xrange isn't passed in, determine xrange from the models.
# Different ways of extracting the x range, depending on model type
if (is.null(xrange)) {
if (any(class(model) %in% c("lm", "glm")))
xrange <- range(model$model[[xvar]])
else if (any(class(model) %in% "loess"))
xrange <- range(model$x)
}
newdata <- data.frame(x = seq(xrange[1], xrange[2], length.out = samples))
names(newdata) <- xvar
newdata[[yvar]] <- predict(model, newdata = newdata, ...)
newdata
}
predvals <- ldply(models, .fun=predictvals, xvar="ageYear", yvar="heightIn")
head(predvals)
sex ageYear heightIn
1 f 11.58000 57.96250
2 f 11.63980 58.03478
3 f 11.69960 58.10707
4 f 11.75939 58.17936
5 f 11.81919 58.25165
6 f 11.87899 58.32394
#Finally, we can plot the data with the predicted values
ggplot(heightweight, aes(x=ageYear, y=heightIn, colour=sex)) +
geom_point() + geom_line(data=predvals)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
make_model <- function(data) {
lm(heightIn ~ ageYear, data)
}
library(plyr)
models <- dlply(heightweight, "sex", .fun = make_model)
predictvals <- function(model, xvar, yvar, xrange=NULL, samples=100, ...) {
# If xrange isn't passed in, determine xrange from the models.
# Different ways of extracting the x range, depending on model type
if (is.null(xrange)) {
if (any(class(model) %in% c("lm", "glm")))
xrange <- range(model$model[[xvar]])
else if (any(class(model) %in% "loess"))
xrange <- range(model$x)
}
newdata <- data.frame(x = seq(xrange[1], xrange[2], length.out = samples))
names(newdata) <- xvar
newdata[[yvar]] <- predict(model, newdata = newdata, ...)
newdata
}
predvals <- ldply(models, .fun=predictvals, xvar="ageYear", yvar="heightIn",
xrange=range(heightweight$ageYear))
#Then we can plot it, the same as we did before:
ggplot(heightweight, aes(x=ageYear, y=heightIn, colour=sex)) +
geom_point() + geom_line(data=predvals)
You want to add numerical information about a model to a plot.
To add simple text to a plot, simply add an annotation. In this example, we’ll create a linear model and use the predictvals() function defined in Recipe 5.7 to create a prediction line from the model. Then we’ll add an annotation.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
model <- lm(heightIn ~ ageYear, heightweight)
summary(model)
Call:
lm(formula = heightIn ~ ageYear, data = heightweight)
Residuals:
Min 1Q Median 3Q Max
-8.3517 -1.9006 0.1378 1.9071 8.3371
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.4356 1.8281 20.48 <2e-16 ***
ageYear 1.7483 0.1329 13.15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.989 on 234 degrees of freedom
Multiple R-squared: 0.4249, Adjusted R-squared: 0.4225
F-statistic: 172.9 on 1 and 234 DF, p-value: < 2.2e-16
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
# First generate prediction data
pred <- predictvals(model, "ageYear", "heightIn")
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn)) + geom_point() + geom_line(data=pred)
sp + annotate("text", label="r^2=0.42", x=16.5, y=52)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
sp + annotate("text", label="r^2 == 0.42", parse = TRUE, x=16.5, y=52)
Text geoms in ggplot2 do not take expression objects directly; instead, they take character strings that are turned into expressions with parse(text=“a + b”).
If you use a math expression, the syntax must be correct for it to be a valid R expression object. You can test validity by wrapping it in expression() and seeing if it throws an error (make sure not to use quotes around the expression). In the example here, == is a valid construct in an expression to express equality, but = is not.
expression(r^2 == 0.42). This is valid.
expression(r^2 = 0.42). This is not valid.
It’s possible to automatically extract values from the model object and build an expression using those values. In this example, we’ll create a string that, when parsed, returns a valid expression:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
eqn <- as.character(as.expression(
substitute(italic(y) == a + b * italic(x) * "," ~~ italic(r)^2 ~ "=" ~ r2,
list(a = format(coef(model)[1], digits=3),
b = format(coef(model)[2], digits=3),
r2 = format(summary(model)$r.squared, digits=2)
))))
eqn
[1] "italic(y) == c(`(Intercept)` = \"37.4\") + c(ageYear = \"1.75\") * italic(x) * \",\" ~ ~italic(r)^2 ~ \"=\" ~ \"0.42\""
parse(text=eqn) # Parsing turns it into an expression
expression(italic(y) == c(`(Intercept)` = "37.4") + c(ageYear = "1.75") *
italic(x) * "," ~ ~italic(r)^2 ~ "=" ~ "0.42")
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(heightweight)
sp <- ggplot(heightweight, aes(x=ageYear, y=heightIn)) + geom_point() + geom_line(data=pred)
sp + annotate("text", label=eqn, parse=TRUE, x=Inf, y=-Inf, hjust=1.1, vjust=-.5)
You want to add marginal rugs to a scatter plot.
Use geom_rug(). For this example (Figure 5-28), we’ll use the faithful data set, which contains data about the Old Faithful geyser in two columns—eruptions, which is the length of each eruption, and waiting, which is the length of time to the next eruption.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(faithful)
ggplot(faithful, aes(x=eruptions, y=waiting)) + geom_point() + geom_rug()
A marginal rug plot is essentially a one-dimensional scatter plot that can be used to visualize the distribution of data on each axis.
In this particular data set, the marginal rug is not as informative as it could be. The resolution of the waiting variable is in whole minutes, and because of this, the rug lines have a lot of overplotting. To reduce the overplotting, we can jitter the line positions and make them slightly thinner by specifying size (Figure 5-29). This helps the viewer see the distribution more clearly:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(faithful)
ggplot(faithful, aes(x=eruptions, y=waiting)) + geom_point() +
geom_rug(position="jitter", size=.2)
You want to add labels to points in a scatter plot.
For annotating just one or a few points, you can use annotate() or geom_text(). For this example, we’ll use the countries data set and visualize the relationship between health expenditures and infant mortality rate per 1,000 live births. To keep things manageable, we’ll just take the subset of countries that spent more than $2000 USD per capita.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
subset(countries, Year==2009 & healthexp>2000)
Name Code Year GDP laborrate healthexp infmortality
254 Andorra AND 2009 NA NA 3089.636 3.1
560 Australia AUS 2009 42130.82 65.2 3867.429 4.2
611 Austria AUT 2009 45555.43 60.4 5037.311 3.6
968 Belgium BEL 2009 43640.20 53.5 5104.019 3.6
1733 Canada CAN 2009 39599.04 67.8 4379.761 5.2
2702 Denmark DNK 2009 55933.35 65.4 6272.729 3.4
3365 Finland FIN 2009 44576.73 60.9 4309.604 2.5
3416 France FRA 2009 40663.05 56.1 4797.966 3.5
3671 Germany DEU 2009 40658.58 59.8 4628.769 3.5
3824 Greece GRC 2009 28936.48 53.7 3040.734 3.5
4436 Iceland ISL 2009 37972.24 77.5 3130.391 1.7
4691 Ireland IRL 2009 49737.93 63.6 4951.845 3.4
4844 Italy ITA 2009 35073.32 49.1 3327.630 3.2
4946 Japan JPN 2009 39456.44 59.5 3321.466 2.4
5864 Luxembourg LUX 2009 106252.24 55.5 8182.855 2.2
6680 Monaco MCO 2009 172676.34 NA 7137.390 3.4
7088 Netherlands NLD 2009 48068.35 66.1 5163.740 3.8
7190 New Zealand NZL 2009 29352.45 68.6 2633.625 4.9
7445 Norway NOR 2009 78408.71 66.9 7661.610 2.9
7955 Portugal PRT 2009 22029.87 62.5 2409.661 3.2
8312 San Marino SMR 2009 NA NA 4089.271 1.9
8822 Slovenia SVN 2009 24101.26 58.9 2174.718 2.5
9077 Spain ESP 2009 31891.39 58.6 3075.013 4.0
9536 Sweden SWE 2009 43406.17 64.8 4251.976 2.4
9587 Switzerland CHE 2009 63524.65 66.9 7140.729 4.1
10454 United Kingdom GBR 2009 35163.41 62.2 3285.050 4.7
10505 United States USA 2009 45744.56 65.0 7410.163 6.6
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
sp <- ggplot(subset(countries, Year==2009 & healthexp>2000),
aes(x=healthexp, y=infmortality)) + geom_point()
g1<-sp + annotate("text", x=4350, y=5.4, label="Canada") +
annotate("text", x=7400, y=6.8, label="USA")
ggplotly(g1)
g2<-sp + geom_text(aes(label=Name), size=4)
ggplotly(g2)
grid.arrange(g1,g2,ncol=2)
To automatically add the labels from your data (Figure 5-30, right), use geom_text() and map a column that is a factor or character vector to the label aesthetic. In this case, we’ll use Name, and we’ll make the font slightly smaller to reduce crowding. The default value for size is 5, which doesn’t correspond directly to a point size:
The automatic method for placing annotations centers each annotation on the x and y coordinates. You’ll probably want to shift the text vertically, horizontally, or both. Setting vjust=0 will make the baseline of the text on the same level as the point (Figure 5-31, left), and setting vjust=1 will make the top of the text level with the point. This usually isn’t enough, though—you can either increase or decrease vjust to shift the labels higher or lower, or you can add or subtract a bit to or from the y mapping to get the same effect (Figure 5-31, right):
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
sp <- ggplot(subset(countries, Year==2009 & healthexp>2000),
aes(x=healthexp, y=infmortality)) + geom_point()
g1<-sp + geom_text(aes(label=Name), size=4, vjust=0)
# Add a little extra to y
g2<-sp + geom_text(aes(y=infmortality+.1, label=Name), size=4, vjust=0)
grid.arrange(g1,g2,ncol=2)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
sp <- ggplot(subset(countries, Year==2009 & healthexp>2000),
aes(x=healthexp, y=infmortality)) + geom_point()
sp + geom_text(aes(label=Name), size=4, hjust=0)
sp + geom_text(aes(x=healthexp+100, label=Name), size=4, hjust=0)
If you are using a logarithmic axis, instead of adding to x or y, you’ll need to multiply the x or y value by a number to shift the labels a consistent amount.
If you want to label just some of the points but want the placement to be handled automatically, you can add a new column to your data frame containing just the labels you want. Here’s one way to do that: first we’ll make a copy of the data we’re using, then we’ll duplicate the Name column into Name1:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
sp <- ggplot(subset(countries, Year==2009 & healthexp>2000),
aes(x=healthexp, y=infmortality)) + geom_point()
cdat <- subset(countries, Year==2009 & healthexp>2000)
cdat$Name1 <- cdat$Name
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
idx <- cdat$Name1 %in% c("Canada", "Ireland", "United Kingdom", "United States","New Zealand", "Iceland", "Japan", "Luxembourg","Netherlands", "Switzerland")
idx
[1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
[13] FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[25] TRUE TRUE TRUE
#Then we’ll use that Boolean vector to overwrite all the other entries in Name1 with NA.
cdat$Name1[!idx] <- NA
#This is what the result looks like
cdat
Name Code Year GDP laborrate healthexp infmortality
254 Andorra AND 2009 NA NA 3089.636 3.1
560 Australia AUS 2009 42130.82 65.2 3867.429 4.2
611 Austria AUT 2009 45555.43 60.4 5037.311 3.6
968 Belgium BEL 2009 43640.20 53.5 5104.019 3.6
1733 Canada CAN 2009 39599.04 67.8 4379.761 5.2
2702 Denmark DNK 2009 55933.35 65.4 6272.729 3.4
3365 Finland FIN 2009 44576.73 60.9 4309.604 2.5
3416 France FRA 2009 40663.05 56.1 4797.966 3.5
3671 Germany DEU 2009 40658.58 59.8 4628.769 3.5
3824 Greece GRC 2009 28936.48 53.7 3040.734 3.5
4436 Iceland ISL 2009 37972.24 77.5 3130.391 1.7
4691 Ireland IRL 2009 49737.93 63.6 4951.845 3.4
4844 Italy ITA 2009 35073.32 49.1 3327.630 3.2
4946 Japan JPN 2009 39456.44 59.5 3321.466 2.4
5864 Luxembourg LUX 2009 106252.24 55.5 8182.855 2.2
6680 Monaco MCO 2009 172676.34 NA 7137.390 3.4
7088 Netherlands NLD 2009 48068.35 66.1 5163.740 3.8
7190 New Zealand NZL 2009 29352.45 68.6 2633.625 4.9
7445 Norway NOR 2009 78408.71 66.9 7661.610 2.9
7955 Portugal PRT 2009 22029.87 62.5 2409.661 3.2
8312 San Marino SMR 2009 NA NA 4089.271 1.9
8822 Slovenia SVN 2009 24101.26 58.9 2174.718 2.5
9077 Spain ESP 2009 31891.39 58.6 3075.013 4.0
9536 Sweden SWE 2009 43406.17 64.8 4251.976 2.4
9587 Switzerland CHE 2009 63524.65 66.9 7140.729 4.1
10454 United Kingdom GBR 2009 35163.41 62.2 3285.050 4.7
10505 United States USA 2009 45744.56 65.0 7410.163 6.6
Name1
254 <NA>
560 <NA>
611 <NA>
968 <NA>
1733 Canada
2702 <NA>
3365 <NA>
3416 <NA>
3671 <NA>
3824 <NA>
4436 Iceland
4691 Ireland
4844 <NA>
4946 Japan
5864 Luxembourg
6680 <NA>
7088 Netherlands
7190 New Zealand
7445 <NA>
7955 <NA>
8312 <NA>
8822 <NA>
9077 <NA>
9536 <NA>
9587 Switzerland
10454 United Kingdom
10505 United States
#Now we can make the plot (Figure 5-33). This time, we’ll also expand the x range so that the text will fit
ggplot(cdat, aes(x=healthexp, y=infmortality)) +geom_point() + geom_text(aes(x=healthexp+100, label=Name1), size=4, hjust=0) + xlim(2000, 10000)
You want to make a balloon plot, where the area of the dots is proportional to their numerical value.
Use geom_point() with scale_size_area(). For this example, we’ll use a subset of the countries data set.
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
cdat <- subset(countries, Year==2009 &
Name %in% c("Canada", "Ireland", "United Kingdom", "United States",
"New Zealand", "Iceland", "Japan", "Luxembourg",
"Netherlands", "Switzerland"))
cdat
Name Code Year GDP laborrate healthexp infmortality
1733 Canada CAN 2009 39599.04 67.8 4379.761 5.2
4436 Iceland ISL 2009 37972.24 77.5 3130.391 1.7
4691 Ireland IRL 2009 49737.93 63.6 4951.845 3.4
4946 Japan JPN 2009 39456.44 59.5 3321.466 2.4
5864 Luxembourg LUX 2009 106252.24 55.5 8182.855 2.2
7088 Netherlands NLD 2009 48068.35 66.1 5163.740 3.8
7190 New Zealand NZL 2009 29352.45 68.6 2633.625 4.9
9587 Switzerland CHE 2009 63524.65 66.9 7140.729 4.1
10454 United Kingdom GBR 2009 35163.41 62.2 3285.050 4.7
10505 United States USA 2009 45744.56 65.0 7410.163 6.6
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
p <- ggplot(cdat, aes(x=healthexp, y=infmortality, size=GDP)) +
geom_point(shape=21, colour="black", fill="cornsilk")
# GDP mapped to radius (default with scale_size_continuous)
p
# GDP mapped to area instead, and larger circles
g<-p + scale_size_area(max_size=15)
g
grid.arrange(p,g,ncol=2)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
# Add up counts for male and female
hec <- HairEyeColor[,,"Male"] + HairEyeColor[,,"Female"]
# Convert to long format
library(reshape2)
hec <- melt(hec, value.name="count")
ggplot(hec, aes(x=Eye, y=Hair)) +
geom_point(aes(size=count), shape=21, colour="black", fill="cornsilk") +
scale_size_area(max_size=20, guide=FALSE) +
geom_text(aes(y=as.numeric(Hair)-sqrt(count)/22, label=count), vjust=1,
colour="grey60", size=4)
In this example we’ve used a few tricks to add the text labels under the circles. First, we used vjust=1 to top-justify the text to the y coordinate. Next, we wanted to set the y coordinate so that it is just underneath the bottom of each circle. This requires a little arithmetic: take the numeric value of Hair and subtract a small value from it, where the value depends in some way on count. This actually requires taking the square root of count, since the radius has a linear relationship with the square root of count. The number that this value divided by (22 in this case) is found by trial and error; it depends on the particular data values, radius, and text size.
The text under the circles is in a shade of grey. This is so that it doesn’t jump out at the viewer and overwhelm the perceptual impact of the circles, but is still available if the viewer wants to know the exact values.
-A scatter plot matrix is an excellent way of visualizing the pairwise relationships among several variables. To make one, use the pairs() function from R’s base graphics. For this example, we’ll use a subset of the countries data set. We’ll pull out the data for the year 2009, and keep only the columns that are relevant:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
c2009 <- subset(countries, Year==2009,
select=c(Name, GDP, laborrate, healthexp, infmortality))
c2009
Name GDP laborrate healthexp
50 Afghanistan NA 59.8 50.88597
101 Albania 3772.6047 59.5 264.60406
152 Algeria 4022.1989 58.5 267.94653
203 American Samoa NA NA NA
254 Andorra NA NA 3089.63589
305 Angola 4068.5757 81.3 203.80787
356 Antigua and Barbuda 12501.8666 NA 652.53169
407 Argentina 7665.0734 65.0 730.17301
458 Armenia 2768.6126 66.3 128.85761
509 Aruba NA NA NA
560 Australia 42130.8203 65.2 3867.42853
611 Austria 45555.4345 60.4 5037.31089
662 Azerbaijan 4808.1688 63.0 284.72528
713 Bahamas, The 20916.2842 73.3 1557.65512
764 Bahrain 17608.8298 63.8 1107.90631
815 Bangladesh 607.7649 70.7 18.43125
866 Barbados 13181.3416 71.7 1041.44095
917 Belarus 5182.6304 60.2 294.61477
968 Belgium 43640.1962 53.5 5104.01899
1019 Belize 4056.1224 64.1 217.15214
1070 Benin 771.7088 72.7 31.92885
1121 Bermuda 88746.8944 NA NA
1172 Bhutan 1772.2838 62.7 97.98912
1223 Bolivia 1774.1952 71.9 84.78763
1274 Bosnia and Herzegovina 4523.3114 61.2 494.87951
1325 Botswana 5790.1819 76.6 611.92342
1376 Brazil 8251.0616 70.7 734.05138
1427 Brunei Darussalam 27390.0500 67.5 790.72871
1478 Bulgaria 6403.1477 54.5 474.84637
1529 Burkina Faso 509.2978 84.4 38.07546
1580 Burundi 162.8704 89.3 19.78891
1631 Cambodia 748.1512 79.3 42.10335
1682 Cameroon 1157.0245 67.0 61.05385
1733 Canada 39599.0418 67.8 4379.76084
1784 Cape Verde 3227.9520 66.4 146.10513
1835 Cayman Islands NA NA NA
1886 Central African Republic 458.5672 79.0 19.33792
1937 Chad 625.3019 70.4 41.78443
1988 Channel Islands NA NA NA
2039 Chile 9487.0111 57.3 787.19332
2090 China 3748.9344 73.7 177.14653
2141 Colombia 5165.7319 58.6 323.01081
2192 Comoros 747.9125 79.6 27.83652
2243 Congo, Dem. Rep. 174.5076 70.8 15.58425
2294 Congo, Rep. 2430.5255 72.7 70.07210
2345 Costa Rica 6369.1663 62.7 668.49256
2396 Cote d'Ivoire 1190.7895 66.9 55.25489
2447 Croatia 14322.6081 53.0 1120.37109
2498 Cuba NA 53.9 707.39670
2549 Curacao NA NA NA
2600 Cyprus 31298.4375 62.3 NA
2651 Czech Republic 18136.8451 57.9 1383.65144
2702 Denmark 55933.3545 65.4 6272.72868
2753 Djibouti 1202.9199 70.1 84.45174
2804 Dominica 5532.0537 NA 361.29188
2855 Dominican Republic 4756.3591 65.1 270.64760
2906 Ecuador 3647.6990 62.3 255.49979
2957 Egypt, Arab Rep. 2370.7111 48.8 113.29717
3008 El Salvador 3425.1706 59.9 228.57189
3059 Equatorial Guinea 17944.4045 65.5 709.40713
3110 Eritrea 364.2048 72.6 10.12132
3161 Estonia 14238.8817 61.2 1004.42550
3212 Ethiopia 393.6832 85.4 14.68076
3263 Faeroe Islands 45205.9305 NA NA
3314 Fiji 3314.2707 58.7 130.41002
3365 Finland 44576.7270 60.9 4309.60396
3416 France 40663.0502 56.1 4797.96556
3467 French Polynesia NA 57.3 NA
3518 Gabon 7411.1839 75.5 266.26454
3569 Gambia, The 436.1472 77.8 25.62880
3620 Georgia 2441.0105 63.7 255.62737
3671 Germany 40658.5823 59.8 4628.76920
3722 Ghana 1098.4257 74.6 45.05245
3773 Gibraltar NA NA NA
3824 Greece 28936.4809 53.7 3040.73383
3875 Greenland 22507.8887 NA NA
3926 Grenada 5906.4568 NA 446.52663
3977 Guam NA 64.4 NA
4028 Guatemala 2684.9664 66.9 186.12313
4079 Guinea 426.6530 84.2 18.77023
4130 Guinea-Bissau 562.4150 71.5 18.35889
4181 Guyana 2689.8559 63.5 132.50867
4232 Haiti 656.7792 69.9 39.60249
4283 Honduras 1921.8795 59.9 117.06353
4334 Hong Kong SAR, China 29881.8144 60.0 NA
4385 Hungary 12847.3031 50.1 937.98617
4436 Iceland 37972.2370 77.5 3130.39086
4487 India 1195.0003 57.6 44.80258
4538 Indonesia 2271.7753 68.9 55.44365
4589 Iran, Islamic Rep. 4525.9486 52.8 268.75470
4640 Iraq 2096.8511 41.3 98.45406
4691 Ireland 49737.9274 63.6 4951.84469
4742 Isle of Man NA NA NA
4793 Israel 26102.3506 57.1 1966.47189
4844 Italy 35073.3225 49.1 3327.62987
4895 Jamaica 4693.4275 64.7 231.08176
4946 Japan 39456.4388 59.5 3321.46639
4997 Jordan 4242.1537 49.3 335.81544
5048 Kazakhstan 7240.5703 70.6 330.11134
5099 Kenya 744.4031 82.2 33.24912
5150 Kiribati 1305.8038 NA 158.50643
5201 Korea, Dem. Rep. NA 66.0 NA
5252 Korea, Rep. 17109.9851 60.9 1107.94833
5303 Kosovo 3010.9934 NA NA
5354 Kuwait 41364.6893 68.5 1416.09990
5405 Kyrgyz Republic 881.3605 66.6 57.09366
5456 Lao PDR 997.2401 78.3 35.82477
5507 Latvia 11475.6923 61.5 750.44926
5558 Lebanon 8321.3707 46.1 663.27358
5609 Lesotho 800.4202 74.0 70.04993
5660 Liberia 229.2703 71.1 29.35613
5711 Libya 9957.4904 52.8 416.74524
5762 Liechtenstein 134914.6728 NA NA
5813 Lithuania 11033.5885 55.7 729.78492
5864 Luxembourg 106252.2442 55.5 8182.85511
5915 Macao SAR, China 40919.3262 69.4 NA
5966 Macedonia, FYR 4510.2380 54.0 313.68971
6017 Madagascar 421.7802 86.4 17.97023
6068 Malawi 327.3363 76.8 19.06556
6119 Malaysia 6908.6611 62.0 336.43858
6170 Maldives 4230.0510 67.1 330.69865
6221 Mali 601.2609 51.9 38.42789
6272 Malta 19326.4588 49.4 1446.43275
6323 Marshall Islands 2861.6376 NA 422.49589
6374 Mauritania 896.1959 70.0 21.91725
6425 Mauritius 6951.2787 57.5 383.05315
6476 Mayotte NA NA NA
6527 Mexico 7879.6773 61.4 514.80196
6578 Micronesia, Fed. Sts. 2498.2833 NA 336.91627
6629 Moldova 1525.5315 49.6 180.89118
6680 Monaco 172676.3407 NA 7137.38972
6731 Mongolia 1690.4170 72.9 74.19826
6782 Montenegro 6569.0869 NA 616.83557
6833 Morocco 2842.3235 52.3 155.67512
6884 Mozambique 428.1975 85.8 24.72483
6935 Myanmar NA 73.8 12.47223
6986 Namibia 4095.5044 57.1 257.96755
7037 Nepal 438.1784 71.5 25.34454
7088 Netherlands 48068.3540 66.1 5163.74038
7139 New Caledonia NA 57.0 NA
7190 New Zealand 29352.4540 68.6 2633.62461
7241 Nicaragua 1088.1658 62.4 104.69487
7292 Niger 351.2742 62.7 20.88920
7343 Nigeria 1091.1344 56.2 69.29737
7394 Northern Mariana Islands NA NA NA
7445 Norway 78408.7138 66.9 7661.60977
7496 Oman 17280.0972 55.7 496.64996
7547 Pakistan 950.1192 54.3 22.55684
7598 Palau 8094.5632 NA 980.54244
7649 Panama 6955.7448 64.6 590.72387
7700 Papua New Guinea 1180.6904 72.9 36.69268
7751 Paraguay 2245.3284 71.9 158.85965
7802 Peru 4412.3903 67.1 200.78948
7853 Philippines 1835.6365 63.8 66.88271
7904 Poland 11287.7389 53.7 803.64614
7955 Portugal 22029.8691 62.5 2409.66063
8006 Puerto Rico NA 46.1 NA
8057 Qatar 61531.6921 84.3 1714.99341
8108 Romania 7500.3404 52.4 408.32754
8159 Russian Federation 8614.6729 62.8 475.24572
8210 Rwanda 510.3116 86.0 48.18416
8261 Samoa 2721.9439 57.5 205.00032
8312 San Marino NA NA 4089.27133
8363 Sao Tome and Principe 1168.8878 59.8 90.73473
8414 Saudi Arabia 13900.6307 54.5 713.85023
8465 Senegal 1056.4957 76.4 58.90160
8516 Serbia 5689.8052 NA 419.03919
8567 Seychelles 9027.7301 NA 365.72463
8618 Sierra Leone 323.4532 66.4 43.85885
8669 Singapore 36757.5243 64.7 1501.33140
8720 Sint Maarten (Dutch part) NA NA NA
8771 Slovak Republic 16174.2284 59.5 1372.68818
8822 Slovenia 24101.2632 58.9 2174.71765
8873 Solomon Islands 1147.2437 37.5 71.91242
8924 Somalia NA 70.3 NA
8975 South Africa 5733.0410 55.0 485.43365
9026 South Sudan NA NA NA
9077 Spain 31891.3861 58.6 3075.01325
9128 Sri Lanka 2035.3030 54.2 83.61577
9179 St. Kitts and Nevis 10168.0093 NA 633.91741
9230 St. Lucia 5544.8437 63.0 442.57331
9281 St. Martin (French part) NA NA NA
9332 St. Vincent and the Grenadines 5357.2512 67.5 301.33061
9383 Sudan 1286.1473 52.3 94.59385
9434 Suriname 6255.2800 52.2 429.12629
9485 Swaziland 2512.9783 63.6 155.78038
9536 Sweden 43406.1748 64.8 4251.97636
9587 Switzerland 63524.6523 66.9 7140.72895
9638 Syrian Arab Republic 2691.5977 50.4 72.00606
9689 Tajikistan 733.8741 67.0 37.99971
9740 Tanzania 502.8518 88.4 25.31169
9791 Thailand 3838.2272 72.9 167.70008
9842 Timor-Leste 543.6922 71.0 73.24377
9893 Togo 534.8508 74.4 28.93053
9944 Tonga 3149.9735 64.6 161.18296
9995 Trinidad and Tobago 14684.0532 66.1 1069.21297
10046 Tunisia 4168.9509 48.0 239.96116
10097 Turkey 8553.7415 46.8 570.97184
10148 Turkmenistan 3710.4536 68.0 77.06955
10199 Turks and Caicos Islands NA NA NA
10250 Tuvalu 2609.9266 NA 290.02138
10301 Uganda 488.2459 84.5 42.54540
10352 Ukraine 2545.4803 58.1 179.56664
10403 United Arab Emirates 33183.1701 77.6 1520.05855
10454 United Kingdom 35163.4149 62.2 3285.05021
10505 United States 45744.5596 65.0 7410.16301
10556 Uruguay 9364.1241 64.1 698.16329
10607 Uzbekistan 1181.8601 64.6 62.15114
10658 Vanuatu 2635.3138 83.9 105.83174
10709 Venezuela, RB 11490.0291 66.0 686.34793
10760 Vietnam 1129.2889 71.9 79.71064
10811 Virgin Islands (U.S.) NA 59.5 NA
10862 West Bank and Gaza NA 42.8 NA
10913 Yemen, Rep. 1130.1833 46.8 64.00204
10964 Zambia 1006.3882 69.2 47.05637
11015 Zimbabwe 467.8534 66.8 NA
infmortality
50 103.2
101 17.2
152 32.0
203 NA
254 3.1
305 99.9
356 7.3
407 12.7
458 18.4
509 NA
560 4.2
611 3.6
662 41.1
713 14.0
764 8.9
815 40.0
866 17.1
917 4.5
968 3.6
1019 14.9
1070 74.7
1121 NA
1172 45.6
1223 43.4
1274 7.5
1325 37.1
1376 18.4
1427 5.9
1478 11.1
1529 93.3
1580 88.5
1631 45.6
1682 85.2
1733 5.2
1784 29.9
1835 NA
1886 106.8
1937 99.6
1988 NA
2039 7.7
2090 16.8
2141 17.1
2192 64.0
2243 112.8
2294 60.9
2345 8.8
2396 87.3
2447 4.9
2498 4.7
2549 NA
2600 3.3
2651 3.3
2702 3.4
2753 74.1
2804 11.5
2855 23.2
2906 18.2
2957 20.0
3008 15.0
3059 82.4
3110 43.7
3161 4.8
3212 69.5
3263 NA
3314 15.4
3365 2.5
3416 3.5
3467 NA
3518 55.2
3569 57.8
3620 20.5
3671 3.5
3722 51.3
3773 NA
3824 3.5
3875 NA
3926 9.4
3977 NA
4028 25.9
4079 83.6
4130 93.1
4181 26.2
4232 59.3
4283 21.2
4334 NA
4385 5.7
4436 1.7
4487 49.5
4538 28.1
4589 22.8
4640 31.9
4691 3.4
4742 NA
4793 3.7
4844 3.2
4895 20.7
4946 2.4
4997 18.9
5048 29.9
5099 56.3
5150 39.6
5201 26.4
5252 4.3
5303 NA
5354 9.6
5405 34.0
5456 44.3
5507 8.5
5558 19.4
5609 67.0
5660 77.6
5711 14.0
5762 1.8
5813 5.7
5864 2.2
5915 NA
5966 10.6
6017 44.9
6068 61.4
6119 5.6
6170 15.0
6221 100.6
6272 5.2
6323 22.8
6374 75.4
6425 13.2
6476 NA
6527 14.9
6578 34.9
6629 16.8
6680 3.4
6731 27.8
6782 7.5
6833 31.7
6884 94.6
6935 52.2
6986 31.6
7037 43.3
7088 3.8
7139 NA
7190 4.9
7241 23.7
7292 74.7
7343 90.4
7394 NA
7445 2.9
7496 8.4
7547 70.7
7598 15.1
7649 17.5
7700 47.7
7751 21.4
7802 16.1
7853 23.9
7904 5.4
7955 3.2
8006 NA
8057 7.0
8108 12.4
8159 9.8
8210 62.7
8261 17.5
8312 1.9
8363 53.6
8414 15.5
8465 50.8
8516 6.5
8567 11.7
8618 116.5
8669 2.1
8720 NA
8771 7.0
8822 2.5
8873 23.0
8924 108.3
8975 42.5
9026 NA
9077 4.0
9128 14.7
9179 7.0
9230 14.0
9281 NA
9332 19.2
9383 66.9
9434 27.3
9485 57.4
9536 2.4
9587 4.1
9638 14.3
9689 54.2
9740 52.5
9791 11.6
9842 49.2
9893 67.1
9944 13.8
9995 24.4
10046 14.8
10097 15.0
10148 48.0
10199 NA
10250 27.6
10301 65.0
10352 11.7
10403 6.4
10454 4.7
10505 6.6
10556 9.7
10607 44.5
10658 12.5
10709 16.1
10760 19.3
10811 NA
10862 20.7
10913 58.7
10964 71.5
11015 52.2
pairs(c2009[,2:5])
To make the scatter plot matrix (Figure 5-36), we’ll use columns 2 through 5—using the Name column wouldn’t make sense, and it would produce strange-looking results:
We didn’t use ggplot2 here because it doesn’t make scatter plot matrices (at least, not well).
You can also use customized functions for the panels. To show the correlation coefficient of each pair of variables instead of a scatter plot, we’ll define the function panel.cor. This will also show higher correlations in a larger font. Don’t worry about the details for now—just paste this code into your R session or script:
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use="complete.obs"))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}
#To show histograms of each variable along the diagonal, we’ll define panel.hist:
panel.hist <- function(x, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks
nB <- length(breaks)
y <- h$counts
y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
}
Both of these panel functions are taken from the pairs help page, so if it’s more convenient, you can simply open that help page, then copy and paste. The last line of this version of the panel.cor function is slightly modified, however, so that the changes in font size aren’t as extreme as with the original.
Now that we’ve defined these functions we can use them for our scatter plot matrix, by telling pairs() to use panel.cor for the upper panels and panel.hist for the diagonal panels.
We’ll also throw in one more thing: panel.smooth for the lower panels, which makes a scatter plot and adds a LOWESS smoothed line, as shown in Figure 5-37. (LOWESS is slightly different from LOESS, which we saw in Recipe 5.6, but the differences aren’t important for this sort of rough exploratory visualization):
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y, use="complete.obs"))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * (1 + r) / 2)
}
panel.hist <- function(x, ...) {
usr <- par("usr")
on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks
nB <- length(breaks)
y <- h$counts
y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="white", ...)
}
pairs(c2009[,2:5], upper.panel = panel.cor,
diag.panel = panel.hist,
lower.panel = panel.smooth)
library(gcookbook) # For the data set
library(ggplot2)
library(dplyr)
library(gridExtra)
library(plotly)
library(hexbin)
data(countries)
panel.lm <- function (x, y, col = par("col"), bg = NA, pch = par("pch"),
cex = 1, col.smooth = "black", ...) {
points(x, y, pch = pch, col = col, bg = bg, cex = cex)
abline(stats::lm(y ~ x), col = col.smooth, ...)
}
pairs(c2009[,2:5], pch=".",
upper.panel = panel.cor,
diag.panel = panel.hist,
lower.panel = panel.lm)
This time the default line color is black instead of red, though you can change it here (and with panel.smooth) by setting col.smooth when you call pairs(). We’ll also use small points in the visualization, so that we can distinguish them a bit better (Figure 5-38). This is done by setting pch=“.”:
The size of the points can also be controlled using the cex parameter. The default value for cex is 1; make it smaller for smaller points and larger for larger points. Values below .5 might not render properly with PDF output.
The ggpairs() function from the GGally package can also make scatter plot matrices.